Data Statement Chunking

ABSTRACT

Techniques are presented for applying fine-grained client-specific rules to divide (e.g., chunk) data statements to achieve cost reduction and/or failure rate reduction associated with executing the data statements over a subject dataset. Data statements for the subject dataset are received from a client. Statement attributes derived from the data statements are processed with respect to fine-grained rules and/or other client-specific data to determine whether a data statement chunking scheme is to be applied to the data statements. If a data statement chunking scheme is to be applied, further analysis is performed to select a data statement chunking scheme. A set of data operations are generated based at least in part on the selected data statement chunking scheme. The data operations are issued for execution over the subject dataset. The results from the data operations are consolidated in accordance with the selected data statement chunking scheme and returned to the client.

RELATED APPLICATIONS

This application is a continuation of U.S. Pat. Application Ser. No.15/836,836 titled, “DATA STATEMENT CHUNKING” (Attorney Docket No.ATSC-P0010-10-US), filed Dec. 9, 2017, the entire teachings of which isincorporated by reference in its entirety.

FIELD

This disclosure relates to data analytics, and more particularly totechniques for data statement chunking.

BACKGROUND

The increasing volume, velocity, and variety of information assets(e.g., data) drive the design and implementation of modern data storageenvironments. While all three components of data management are growing,the volume of data often has the most direct impact on costs incurred bya data management client (e.g., user, enterprise, process, etc.). As anexample, a data analyst from a particular enterprise that is issuingdata statements (e.g., queries) on a large dataset from a businessintelligence (BI) application will incur costs related to the processing(e.g., CPU) resources consumed to execute the data operations invoked bysuch data statements. Specifically, executing an SQL aggregation querywith high cardinality (e.g., due to the number of dimensions in theGROUP BY clause, etc.) over a large dataset (e.g., millions of rows) canincur significant processing costs.

With today’s distributed and/or cloud-based storage environments, directexpenditures associated with using various computing networks and/oraccessing certain storage facilities (e.g., egress costs, etc.) can alsobe incurred. The “cost” of human resources consumed while a user waitsfor a query to execute over a large dataset can also be significant. Insome cases, the probability that a query might fail increasescommensurately with the size of the dataset. If a query fails, then theforegoing costs have been incurred in vain since no query results areproduced.

Unfortunately, legacy techniques for managing the foregoing cost and/orfailure characteristics of data operations over large datasets havelimitations. One legacy approach relies on a query engine associatedwith the subject dataset to execute the data operations in a manner thatachieves certain objectives. The objectives considered by such queryengines, however, are deficient in addressing the aforementioned costand/or failure challenges associated with data operations over largedatasets. For example, query engines might execute data operations inaccordance with broad-based rules pertaining to resource loading and/orresource balancing rather than in consideration of incurred resourcecosts.

In addition to the limited rules available to the query engine, thecorpus of information available at the query engine to apply to dataoperation execution decisions is also limited. Specifically, the queryengine does not have access to certain data (e.g., statistical data,behavioral data, etc.) associated with the client (e.g., user,enterprise, process, etc.) that is required to address the dataoperation cost and/or failure challenges that are specific to theclient. What is needed is a technological solution that reduces theclient-specific costs and/or failure rates associated with performingdata operations over large datasets.

Some of the approaches described in this background section areapproaches that could be pursued, but not necessarily approaches thathave been previously conceived or pursued. Therefore, unless otherwiseindicated, it should not be assumed that any of the approaches describedin this section qualify as prior art merely by their occurrence in thissection.

SUMMARY

The present disclosure describes techniques used in systems, methods,and in computer program products for data statement chunking, whichtechniques advance the relevant technologies to address technologicalissues with legacy approaches. More specifically, the present disclosuredescribes techniques used in systems, methods, and in computer programproducts for rule-based chunking of data statements for operation overlarge datasets in data storage environments. Certain embodiments aredirected to technological solutions for evaluating client-specific rulesand/or data to chunk data statements into data operations thatfacilitate cost reduction and/or failure rate reduction associated withexecuting the data statements over large datasets in a data storageenvironment.

The disclosed embodiments modify and improve over legacy approaches. Inparticular, the herein-disclosed techniques provide technical solutionsthat address the technical problems attendant to reducing theclient-specific costs and/or failure rates associated with dataoperations that are performed over large datasets. Such technicalsolutions relate to improvements in computer functionality. Variousapplications of the herein-disclosed improvements in computerfunctionality serve to reduce the demand for computer memory, reduce thedemand for computer processing power, reduce network bandwidth use, andreduce the demand for inter-component communication. Some embodimentsdisclosed herein use techniques to improve the functioning of multiplesystems within the disclosed environments, and some embodiments advanceperipheral technical fields as well. As one specific example, use of thedisclosed techniques and devices within the shown environments asdepicted in the figures provide advances in the technical field ofdatabase systems as well as advances in various technical fields relatedto processing heterogeneously structured data.

Further details of aspects, objectives, and advantages of thetechnological embodiments are described herein and in the drawings andclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings described below are for illustration purposes only. Thedrawings are not intended to limit the scope of the present disclosure.

FIG. 1 presents a diagram that depicts several implementation techniquespertaining to rule-based chunking of data statements for operation overlarge datasets in data storage environments, according to someembodiments.

FIG. 2 depicts a data statement chunking technique as implemented insystems that facilitate rule-based chunking of data statements foroperation over large datasets in data storage environments, according toan embodiment.

FIG. 3 shows a computing environment comprising a data analytics enginethat facilitates rule-based chunking of data statements for operationover large datasets in data storage environments, according to anembodiment.

FIG. 4A and FIG. 4B illustrate a data statement chunking schemeselection technique as implemented in systems that facilitate rule-basedchunking of data statements for operation over large datasets in datastorage environments, according to an embodiment.

FIG. 5A and FIG. 5B present a data operations generation technique asimplemented in systems that facilitate rule-based chunking of datastatements for operation over large datasets in data storageenvironments, according to an embodiment.

FIG. 6 depicts system components as arrangements of computing modulesthat are interconnected so as to implement certain of theherein-disclosed embodiments.

FIG. 7A and FIG. 7B present block diagrams of computer systemarchitectures having components suitable for implementing embodiments ofthe present disclosure and/or for use in the herein-describedenvironments.

DETAILED DESCRIPTION

Embodiments in accordance with the present disclosure address theproblem of reducing the client-specific costs and/or failure ratesassociated with data operations that are performed over large datasets.Some embodiments are directed to approaches for evaluatingclient-specific rules and/or data to chunk data statements into dataoperations that facilitate cost reduction and/or failure rate reductionassociated with executing the data statements over large datasets in adata storage environment. The accompanying figures and discussionsherein present example environments, systems, methods, and computerprogram products for rule-based chunking of data statements foroperation over large datasets in data storage environments.

Overview

Disclosed herein are techniques that apply fine-grained client-specificrules to divide (e.g., chunk) data statements so as to achieve costreduction and/or failure rate reduction associated with executing thedata statements over a subject dataset in a data storage environment. Incertain embodiments, the data statements for a subject dataset arereceived from a client (e.g., a user, a process, an enterprise, etc.).The data statements are analyzed to derive a set of statement attributesassociated with the data statements. The statement attributes areexposed to the fine-grained rules and/or other client-specific data todetermine whether a data statement chunking scheme is to be applied tothe data statements. If a data statement chunking scheme is to beapplied, further analysis is performed to select a data statementchunking scheme. A set of data operations are generated based at leastin part on the selected data statement chunking scheme. The dataoperations are issued to a query engine for execution over the subjectdataset. The results from the data operations are consolidated inaccordance with the selected data statement chunking scheme and returnedto the client. In certain embodiments, the client-specific rules and/orother client-specific data accessed to evaluate the rules areinaccessible by the query engine. In certain embodiments, theclient-specific data comprise expanded dataset metadata (e.g., semanticinformation) that corresponds to the subject dataset.

Definitions and Use of Figures

Some of the terms used in this description are defined below for easyreference. The presented terms and their respective definitions are notrigidly restricted to these definitions-a term may be further defined bythe term’s use within this disclosure. The term “exemplary” is usedherein to mean serving as an example, instance, or illustration. Anyaspect or design described herein as “exemplary” is not necessarily tobe construed as preferred or advantageous over other aspects or designs.Rather, use of the word exemplary is intended to present concepts in aconcrete fashion. As used in this application and the appended claims,the term “or” is intended to mean an inclusive “or” rather than anexclusive “or”. That is, unless specified otherwise, or is clear fromthe context, “X employs A or B” is intended to mean any of the naturalinclusive permutations. That is, if X employs A, X employs B, or Xemploys both A and B, then “X employs A or B” is satisfied under any ofthe foregoing instances. As used herein, at least one of A or B means atleast one of A, or at least one of B, or at least one of both A and B.In other words, this phrase is disjunctive. The articles “a” and “an” asused in this application and the appended claims should generally beconstrued to mean “one or more” unless specified otherwise or is clearfrom the context to be directed to a singular form.

Various embodiments are described herein with reference to the figures.It should be noted that the figures are not necessarily drawn to scaleand that elements of similar structures or functions are sometimesrepresented by like reference characters throughout the figures. Itshould also be noted that the figures are only intended to facilitatethe description of the disclosed embodiments-they are not representativeof an exhaustive treatment of all possible embodiments, and they are notintended to impute any limitation as to the scope of the claims. Inaddition, an illustrated embodiment need not portray all aspects oradvantages of usage in any particular environment.

An aspect or an advantage described in conjunction with a particularembodiment is not necessarily limited to that embodiment and can bepracticed in any other embodiments even if not so illustrated.References throughout this specification to “some embodiments” or “otherembodiments” refer to a particular feature, structure, material orcharacteristic described in connection with the embodiments as beingincluded in at least one embodiment. Thus, the appearance of the phrases“in some embodiments” or “in other embodiments” in various placesthroughout this specification are not necessarily referring to the sameembodiment or embodiments. The disclosed embodiments are not intended tobe limiting of the claims.

Descriptions of Example Embodiments

FIG. 1 presents a diagram 100 that depicts several implementationtechniques pertaining to rule-based chunking of data statements foroperation over large datasets in data storage environments. As anoption, one or more variations of diagram 100 or any aspect thereof maybe implemented in the context of the architecture and functionality ofthe embodiments described herein. The diagram 100 or any aspect thereofmay be implemented in any environment.

The diagram shown in FIG. 1 is merely one example representation of theherein disclosed techniques that facilitate rule-based chunking of datastatements for operation over large datasets in data storageenvironments. More specifically, and as shown in FIG. 1 , the hereindisclosed techniques facilitate chunking of data statements based atleast in part on a set of client-specific data 114 (e.g., statementchunking rules, etc.) accessible in a client data statement processinglayer 110. The client-specific data (e.g., fine-grained client-specificrules, etc.) are applied to divide (e.g., chunk) the data statements soas to facilitate cost reduction and/or failure rate reduction associatedwith executing the data statements over a subject dataset 146 in a datastorage environment 140. The techniques disclosed herein address theproblems attendant to managing the costs and/or failure rates associatedwith data operations that are performed over the subject dataset 146,particularly as the subject dataset 146 increases in size (e.g., numberof input rows, number of result rows, etc.).

In the shown embodiment, the data statements for the subject dataset 146are received from a client 102 (e.g., one or more users 104, one or moreprocesses 106, etc.) at a data statement chunking agent 112 in theclient data statement processing layer 110 (operation 1). Theclient-specific data 114 is consulted to determine a chunking scheme forthe data statements (operation 2). As an example, the client-specificdata 114 might comprise statement chunking rules from the client 102.The client-specific data 114 might further comprise performance dataassociated with data statements earlier processed in the client datastatement processing layer 110. The client-specific data 114 might alsocomprise metadata describing certain characteristics (e.g., one or moredata models) associated with the subject dataset 146 that are unique tothe client 102 and/or to the client data statement processing layer 110.Other information may also be included in the client-specific data 114.

When the chunking scheme is determined, a data statement processingagent 116 executes one or more data operations over the subject dataset146 in accordance with the chunking scheme (operation 3). A result setproduced by the executed data operations is returned to the client 102(operation 4). For example, if the chunking scheme calls for a singleissued data statement to be chunked into N data statements, dataoperations to carry out the N data statements are generated and executedto return a result set in response to the single issued data statement.

An embodiment of the herein disclosed techniques as implemented in adata statement chunking technique is shown and described as pertains toFIG. 2 .

FIG. 2 depicts a data statement chunking technique 200 as implemented insystems that facilitate rule-based chunking of data statements foroperation over large datasets in data storage environments. As anoption, one or more variations of data statement chunking technique 200or any aspect thereof may be implemented in the context of thearchitecture and functionality of the embodiments described herein. Thedata statement chunking technique 200 or any aspect thereof may beimplemented in any environment.

The data statement chunking technique 200 presents one embodiment ofcertain steps and/or operations that facilitate rule-based chunking ofdata statements for operation over large datasets in data storageenvironments. As shown, the data statement chunking technique 200 cancommence by receiving one or more data statements from a client tooperate over a subject dataset (step 230). A chunking scheme for thedata statements is determined based at least in part on a set ofclient-specific data (step 240). For example, the client-specific datamight comprise a set of statement chunking rules, a set of performancedata, a set of expanded dataset metadata, and/or other data specific toa client environment. One or more data operations corresponding to thedata statements are generated based at least in part on the chunkingscheme (step 250). The data operations are executed (step 260), and theresults from the data operations are merged (step 270) into a result setthat is returned to the client (step 280).

A detailed embodiment of a system and data flows that implement thetechniques disclosed herein is presented and discussed as pertains toFIG. 3 .

FIG. 3 shows a computing environment 300 comprising a data analyticsengine that facilitates rule-based chunking of data statements foroperation over large datasets in data storage environments. As anoption, one or more variations of computing environment 300 or anyaspect thereof may be implemented in the context of the architecture andfunctionality of the embodiments described herein.

As shown in the embodiment of FIG. 3 , data statement chunking agent 112is implemented in a data analytics engine 310 to facilitate rule-basedchunking of data statements 322 issued for operation over subjectdataset 146 stored in a storage pool 344 in data storage environment140. In this embodiment, data analytics engine 310 serves to establishthe client data statement processing layer 110 earlier described. As canbe observed, a planning agent 312 at data analytics engine 310 receivesthe data statements 322 from client 102 (e.g., one or more users 104,one or more processes 106, etc.). For example, the data statements 322might be issued from a data analysis application (e.g., businessintelligence tool) managed by one of the users 104 (e.g., a dataanalyst). As another example, the data statements 322 might be issued bya process that calculates statistics, or a process that generatesaggregated data. The planning agent 312 analyzes the data statements 322to determine one or more statement attributes 324 associated with thedata statements 322. As an example, the statement attributes 324 mightdescribe the structure and/or constituents (e.g., clauses, predicates,expressions, etc.) of the data statements 322.

The data statement chunking agent 112 applies certain portions of theclient-specific data 114 accessible by the data analytics engine 310 inthe client data statement processing layer 110 to the statementattributes 324 to determine a chunking scheme 326 ₁ (if any) for thedata statements 322. As an example, the client-specific data 114 mightcomprise statement chunking rules 352 from the client 102 that are usedto determine whether the data statements are to be chunked.Specifically, the statement chunking rules 352 might establish a set oflogic that will invoke chunking operations if a certain threshold for amaximum allowed number of query input rows is breached. In some cases, aset of performance data 354 from the client-specific data 114 isconsulted to determine one or more performance estimates (e.g.,estimated processing cost, estimated processing time, etc.) for the datastatements 322.

As shown, the performance data 354 may be derived by an execution agent314 from data statements earlier processed at the data analytics engine310. One or more of these performance estimates might be exposed to thestatement chunking rules 352 to determine whether to invoke chunkingoperations. In some cases, certain portions of the performance data 354may be derived from dataset statistics 355 presented by one or morecomponents of the data storage environment 140. For example, thefilesystem of the data storage environment 140 might provide informationas to the size of the files comprising the subject dataset 146. Asanother example, one or more of the query engines 342 at the datastorage environment 140 might present certain statistics and/orinformation pertaining to the subject dataset 146 (e.g., row counts,histogram information, number of distinct values, number of null values,etc.) that might be useful in determining whether to invoke chunkingoperations and/or in determining a chunking scheme.

If the data statements are to be chunked, a set of expanded datasetmetadata 358 from the client-specific data 114 might be accessed tofacilitate selection of the chunking scheme 326 ₁. For example, theexpanded dataset metadata 358 can be accessed to determine one or moredimensions associated with the subject dataset 146 that might serve as achunking dimension. As can be observed in FIG. 3 , the expanded datasetmetadata 358 is derived at least in part from a set of dataset metadata356 by a data model manager 320 at the data analytics engine 310. Thedata model manager 320 might facilitate the generation of relationshipsand/or other characteristics associated with the subject dataset 146that are included in the expanded dataset metadata 358, but are notincluded in the dataset metadata 356 from the data storage environment140. In some cases, the expanded dataset metadata 358 is generated basedat least in part on input (e.g., data model design specifications, etc.)from client 102.

The chunking scheme 326 ₁ determined by the data statement chunkingagent 112 is received by the planning agent 312 to generate a statementchunking plan 328 that is delivered to the execution agent 314. Ascheduler 316 at the execution agent 314 issues one or more dataoperations 332 to one or more query engines 342 at the data storageenvironment 140 in accordance with the statement chunking plan 328. Insome cases, the data operations 332 comprise execution directives thatcontrol the execution of the data operations 332 at the execution agent314 and/or the data storage environment 140. As an example, an executiondirective might indicate that the results 334 from the data operations332 issued to the query engines 342 are to be merged by a resultprocessor 318 at the execution agent 314 into a result set 336. Theresult set 336 can then be accessed by client 102.

As can be concluded from the foregoing discussion pertaining to FIG. 3and/or other discussions herein, data statement chunking in accordancewith the herein disclosed techniques is facilitated in environments withaccess to the client-specific data 114, at least as described and/orimplemented herein. As an example, while such data statement chunkingcan be implemented at the data analytics engine 310 in the client datastatement processing layer 110 of FIG. 3 , data statement chunkingaccording to the herein disclosed techniques is not performed in otherenvironments that lack access to the client-specific data 114.Specifically, a lack of access to the client-specific data 114 by thequery engines 342 in data storage environment 140 precludes theperformance of data statement chunking according to the herein disclosedtechniques at the query engines 342.

Further details pertaining to selecting a data statement chunking schemeaccording to the herein disclosed techniques are shown and described aspertains to FIG. 4A and FIG. 4B.

FIG. 4A and FIG. 4B illustrate a data statement chunking schemeselection technique 400 as implemented in systems that facilitaterule-based chunking of data statements for operation over large datasetsin data storage environments. As an option, one or more variations ofdata statement chunking scheme selection technique 400 or any aspectthereof may be implemented in the context of the architecture andfunctionality of the embodiments described herein. The data statementchunking scheme selection technique 400 or any aspect thereof may beimplemented in any environment.

The data statement chunking scheme selection technique 400 presents oneembodiment of certain steps and/or operations that select a chunkingscheme when performing rule-based chunking of data statements foroperation over large datasets in data storage environments, according tothe herein disclosed techniques. Various illustrations are alsopresented to illustrate the data statement chunking scheme selectiontechnique 400. Further, specialized data structures designed to improvethe way a computer stores and retrieves data in memory when performingsteps and/or operations pertaining to data statement chunking schemeselection technique 400 are also shown in FIG. 4A and FIG. 4B.

As shown, the data statement chunking scheme selection technique 400 cancommence by analyzing one or more data statements issued for operationover a subject dataset to determine a set of corresponding statementattributes (step 402). For example, various statement attributescorresponding to the select statement attribute identifiers 454 might bederived from the representative data statement 452 issued for operationover a “sales_fact_table” dataset. The statement attributes and/or anyother data described herein can be organized and/or stored using varioustechniques.

For example, the select statement attribute identifiers 454 indicatethat the statement attributes might be organized and/or stored in atabular structure (e.g., relational database table), which has rows thatrelate various statement attributes with a particular data statement. Asanother example, the information might be organized and/or stored in aprogramming code object that has instances corresponding to a particulardata statement and properties corresponding to the various attributesassociated with the data statement. Specifically, and as depicted inselect statement attribute identifiers 454, a data record (e.g., tablerow or object instance) for a particular data statement might describe astatement identifier (e.g., stored in a “statement ID” field), a useridentifier (e.g., stored in a “userID” field), a user role identifier(e.g., stored in a “userRole” field), a client identifier (e.g., storedin a “clientID” field), a data model identifier (e.g., stored in a“modelID” field), a timestamp (e.g., stored in a “time” field), a set ofstructure constituents describing the structure of the data statement(e.g., stored in a “structure[ ]” object), and/or other user attributes.

The data statement chunking scheme selection technique 400 also accessesa set of performance data to generate one or more performance estimatesfor the data statements (step 404). For example, a set of historicaldata operations performance statistics 462 and/or a set of historicaldata operations behavioral characteristics 464 from the performance data354 earlier discussed might be used to form a performance predictivemodel that can be applied to the data statements to generate theperformance estimates.

As can be observed in a set of select performance estimate metrics 456,the performance estimates generated for a particular data statementmight include an estimate of the number of input rows addressed by thedata statement (e.g., stored in an “eSizeIn” field), an estimate of thenumber of output rows resulting from execution of the data statement(e.g., stored in an “eSizeOut” field), an estimate of the cost ofexecuting the data statement (e.g., stored in an “eCost” field), anestimate of the time to execute the data statement (e.g., stored in an“eTime” field), an estimate of the execution failure probabilityassociated with the data statement (e.g., stored in an “eFailure”field), and/or other performance estimates. In certain embodiments, theperformance estimates for a particular data statement can be storedand/or organized in a data structure that also includes the statementattributes corresponding to the data statement.

The data statement chunking scheme selection technique 400 can continuewith applying one or more statement chunking rules to the statementattributes and/or the performance estimates corresponding to the datastatements (step 406). As an example, the select statement chunkingrules 458 from the statement chunking rules 352 earlier discussed mightbe applied to the statement attributes and/or performance estimatespertaining to the representative data statement 452. The statementchunking rules are evaluated to determine whether to chunk or not chunkthe data statements (decision 408). For example, rule “r01”, rule “r02”,and rule “r03” of the select statement chunking rules 458 respectivelycompare “eSizeOut”, “ecost”, and “eTime” to certain threshold values todetermine whether to chunk the data statement. If the application of thestatement chunking rules indicate no chunking is to be performed (e.g.,the foregoing rule comparisons all evaluate to “false”) (see “No” pathof decision 408), then the data statements are processed withoutchunking (step 410). If the application of the statement chunking rulesindicate chunking is to be performed (e.g., at least one of theforegoing rule comparisons evaluate to “true”) (see “Yes” path ofdecision 408), then the data statement chunking scheme selectiontechnique 400 continues with steps and/or operations for determining howto chunk the data statements.

Referring to FIG. 4B, the data statement chunking scheme selectiontechnique 400 receives statement attributes and/or performance estimatesfor data statements (e.g., representative data statement 452) identifiedfor chunking (step 412). A set of expanded dataset metadata is accessedto identify one or more chunking parameters for chunking the datastatements (step 414). For example, and as illustrated, the expandeddataset metadata 358 might be accessed to identify the chunkingparameters. As earlier discussed, the expanded dataset metadata 358 is arich set of metadata derived from certain information about the subjectdataset that is associated with the data statements. In certainembodiments, the expanded dataset metadata 358 is available in a set ofclient-specific data that facilitates the herein disclosed techniques.

As can be observed in a set of representative expanded metadata 468, theexpanded dataset metadata 358 might described, for a particular subjectdataset, a set of dimensions (e.g., stored in a “dimensions [ ]”object), a set of measures (e.g., stored in a “measures [ ]” object), aset of relationships (e.g., stored in a “relationships[ ]” object), aset of hierarchies (e.g., stored in a “hierarchies[ ]” object), a set ofadditivity characteristics (e.g., stored in a “additivity[ ]” object), aset of cardinality characteristics (e.g., stored in a “cardinality[ ]”object), a set of structure characteristics (e.g., stored in a“structure[ ]” object), and/or other metadata associated with thesubject dataset. For example, the cardinality characteristics mightinclude histograms, skews, count of empty or null fields, correlationsto various attributes (e.g., keys, columns, etc.), and/or othercharacteristics. In the example shown in FIG. 4A, the expanded datasetmetadata 358 is accessed to determine a set of identified chunkingdimensions 472 (e.g., “gender”, “ageRange”, etc.) for chunking the datastatements. In this case, the expanded dataset metadata 358 facilitatesthe identification of dimensions (e.g., “gender”, “ageRange”, etc.) forchunking that would otherwise not be known (e.g., merely from the datastatement attributes).

Using the chunking parameters (e.g., dimensions) identified for chunkingthe data statements, a set of candidate chunking schemes are determined(step 416). As shown, a set of candidate chunking schemes 474 based atleast in part on the identified chunking dimensions 472 comprise achunking scheme 326 ₂ that will chunk the data statements into “2Xchunks by gender” and a chunking scheme 326 ₃ that will chunk the datastatements into “10X chunks by ageRange”. Other chunking parametersand/or candidate chunking schemes are possible. The candidate chunkingschemes are analyzed to estimate the performance of each scheme (step418). For example, certain performance estimates (e.g., cost, time,failure rate, etc.) can be generated for the candidate chunking schemes474. One of the candidate chunking schemes is then selected from thecandidate chunking schemes based at least in part on the performanceestimates of the schemes (step 420). The chunking scheme 326 ₂ (e.g.,“2X chunks by gender”) might be selected as the selected chunking scheme478 since the estimated resource cost to execute two data statementchunks over the subject dataset is less than the estimated resource costto execute 10 data statement chunks over the subject dataset asspecified by chunking scheme 326 ₃ (e.g., “10X chunks by ageRange”).

Further details pertaining to generating data operations for aparticular chunking scheme according to the herein disclosed techniquesare shown and described as pertains to FIG. 5A and FIG. 5B.

FIG. 5A and FIG. 5B present a data operations generation technique 500as implemented in systems that facilitate rule-based chunking of datastatements for operation over large datasets in data storageenvironments. As an option, one or more variations of data operationsgeneration technique 500 or any aspect thereof may be implemented in thecontext of the architecture and functionality of the embodimentsdescribed herein. The data operations generation technique 500 or anyaspect thereof may be implemented in any environment.

The data operations generation technique 500 presents one embodiment ofcertain steps and/or operations that generate data operations to carryout data statements that are chunked according to the herein disclosedtechniques. Example pseudo-code are also presented to illustrate thedata operations generation technique 500.

As shown in FIG. 5A, the data operations generation technique 500 cancommence by receiving statement attributes for a data statementidentified for chunking in accordance with a certain chunking scheme(step 502). One or more data operations are generated to carry out thedata statement in accordance with the chunking scheme (step 504). One ormore execution directives are determined for the data operations (step506) to facilitate execution of the data operations (step 508). Forexample, the execution directives might be consulted when scheduling oneor more of the data operations for execution at a query engineassociated with the subject dataset. Specifically, the executiondirectives might indicate that certain data operations are to beexecuted in parallel, in sequence, asynchronously, or synchronously. Theexecution directives might also indicate where certain partial results(e.g., from each of the chunks) are to be stored. As shown, the dataoperations and the execution directives might comprise an instance of astatement chunking plan 328.

Referring to FIG. 5B, an example set of representative data operations520 generated according to the data operations generation technique 500and/or other herein disclosed techniques is shown. As can be observed,the representative data operations 520 represent data operationsgenerated to carry out chunking of the representative data statement 452in accordance with the chunking scheme 326 ₂ (e.g., “2X chunks bygender”). Specifically, the representative data operations 520 comprisetwo chunk operations, chunk operation 522 ₁ and chunk operation 522 ₂,that correspond to the representative data statement 452 as chunked “bygender”.

As shown, chunk operation 522 ₁ creates a “chunk_m” table correspondingto “customer_gender=‘male’” and chunk operation 522 ₂ creates a“chunk_f” table corresponding to “customer_gender=‘female’”. A mergeoperation 524 is generated to merge the results from the “chunk_m” tableand the “chunk_f” table. In some cases, such a merge operation can beperformed at a client agent (e.g., BI application) associated with theclient issuing the data statement, while in other cases the mergeoperation can be performed at an execution agent in a client datastatement processing layer. The results of each chunk operation mightalso be streamed to a client agent as the results become available. Forexample, results might be streamed to a client agent when no subsequentaggregation is to be performed, and/or when the client specifies thechunking dimension (e.g., “gender”).

A set of cleanup operations, cleanup operation 526 ₁ and cleanupoperation 526 ₂, are generated to drop the “chunk_m” table and the“chunk_f” table. A set of execution directives 528 are also generated tofacilitate the execution of the data operations. Specifically, executiondirectives 528 indicate the chunk operation 522 ₁ (e.g., identified as“chunkOp1”) and chunk operation 522 ₂ (e.g., identified as “chunkOp2”)are to be executed in parallel with results stored at database “dbX”.The merge operation 524 (e.g., identified as “mergeOp”) is to beexecuted in sequence with the chunk operations on tables stored indatabase “dbX”. The execution directives 528 further specify thatcleanup operation 526 ₁ (e.g., identified as “cleanOp1”) and cleanupoperation 526 ₂ (e.g., identified as “cleanOp2”) are to be executed inparallel in response to the completion of merge operation 524. Theresult set produced by the merge operation 524 can then be presented tothe client that issued the representative data statement 452.

Additional Embodiments of the Disclosure Additional PracticalApplication Examples

FIG. 6 depicts a system 600 as an arrangement of computing modules thatare interconnected so as to operate cooperatively to implement certainof the herein-disclosed embodiments. This and other embodiments presentparticular arrangements of elements that, individually and/or ascombined, serve to form improved technological processes that addressreducing the client-specific costs and/or failure rates associated withdata operations that are performed over large datasets. The partitioningof system 600 is merely illustrative and other partitions are possible.As an option, the system 600 may be implemented in the context of thearchitecture and functionality of the embodiments described herein. Ofcourse, however, the system 600 or any operation therein may be carriedout in any desired environment. The system 600 comprises at least oneprocessor and at least one memory, the memory serving to store programinstructions corresponding to the operations of the system. As shown, anoperation can be implemented in whole or in part using programinstructions accessible by a module. The modules are connected to acommunication path 605, and any operation can communicate with otheroperations over communication path 605. The modules of the system can,individually or in combination, perform method operations within system600. Any operations performed within system 600 may be performed in anyorder unless as may be specified in the claims. The shown embodimentimplements a portion of a computer system, presented as system 600,comprising one or more computer processors to execute a set of programcode instructions (module 610) and modules for accessing memory to holdprogram code instructions to perform: receiving one or more datastatements issued by at least one client, the data statements issued bythe client to operate over a subject dataset (module 620); applying atleast a portion of a set of client-specific data to the data statementsto determine at least one chunking scheme (module 630); generating oneor more data operations from the data statements, the data operationsgenerated based at least in part on the chunking scheme (module 640);and executing the data operations over the subject dataset to generate aresult set (module 650).

Variations of the foregoing may include more or fewer of the shownmodules. Certain variations may perform more or fewer (or different)steps, and/or certain variations may use data elements in more, or infewer (or different) operations.

System Architecture Overview Additional System Architecture Examples

FIG. 7A depicts a block diagram of an instance of a computer system 7A00suitable for implementing embodiments of the present disclosure.Computer system 7A00 includes a bus 706 or other communication mechanismfor communicating information. The bus interconnects subsystems anddevices such as a CPU, or a multi-core CPU (e.g., data processor 707), asystem memory (e.g., main memory 708, or an area of random access memory(RAM)), a non-volatile storage device or non-volatile storage area(e.g., read-only memory or ROM 709), an internal storage device 710 orexternal storage device 713 (e.g., magnetic or optical), a datainterface 733, a communications interface 714 (e.g., PHY, MAC, Ethernetinterface, modem, etc.). The aforementioned components are shown withinprocessing element partition 701, however other partitions are possible.The shown computer system 7A00 further comprises a display 711 (e.g.,CRT or LCD), various input devices 712 (e.g., keyboard, cursor control),and an external data repository 731.

According to an embodiment of the disclosure, computer system 7A00performs specific operations by data processor 707 executing one or moresequences of one or more program code instructions contained in amemory. Such instructions (e.g., program instructions 702 ₁, programinstructions 702 ₂, program instructions 702 ₃, etc.) can be containedin or can be read into a storage location or memory from any computerreadable/usable medium such as a static storage device or a disk drive.The sequences can be organized to be accessed by one or more processingentities configured to execute a single process or configured to executemultiple concurrent processes to perform work. A processing entity canbe hardware-based (e.g., involving one or more cores) or software-based,and/or can be formed using a combination of hardware and software thatimplements logic, and/or can carry out computations and/or processingsteps using one or more processes and/or one or more tasks and/or one ormore threads or any combination thereof.

According to an embodiment of the disclosure, computer system 7A00performs specific networking operations using one or more instances ofcommunications interface 714. Instances of communications interface 714may comprise one or more networking ports that are configurable (e.g.,pertaining to speed, protocol, physical layer characteristics, mediaaccess characteristics, etc.) and any particular instance ofcommunications interface 714 or port thereto can be configureddifferently from any other particular instance. Portions of acommunication protocol can be carried out in whole or in part by anyinstance of communications interface 714, and data (e.g., packets, datastructures, bit fields, etc.) can be positioned in storage locationswithin communications interface 714, or within system memory, and suchdata can be accessed (e.g., using random access addressing, or usingdirect memory access DMA, etc.) by devices such as data processor 707.

Communications link 715 can be configured to transmit (e.g., send,receive, signal, etc.) any types of communications packets (e.g.,communications packet 738 ₁, communications packet 738 _(N)) comprisingany organization of data items. The data items can comprise a payloaddata area 737, a destination address 736 (e.g., a destination IPaddress), a source address 735 (e.g., a source IP address), and caninclude various encodings or formatting of bit fields to populate packetcharacteristics 734. In some cases, the packet characteristics include aversion identifier, a packet or payload length, a traffic class, a flowlabel, etc. In some cases, payload data area 737 comprises a datastructure that is encoded and/or formatted to fit into byte or wordboundaries of the packet.

In some embodiments, hard-wired circuitry may be used in place of or incombination with software instructions to implement aspects of thedisclosure. Thus, embodiments of the disclosure are not limited to anyspecific combination of hardware circuitry and/or software. Inembodiments, the term “logic” shall mean any combination of software orhardware that is used to implement all or part of the disclosure.

The term “computer readable medium” or “computer usable medium” as usedherein refers to any medium that participates in providing instructionsto data processor 707 for execution. Such a medium may take many formsincluding, but not limited to, non-volatile media and volatile media.Non-volatile media includes, for example, optical or magnetic disks suchas disk drives or tape drives. Volatile media includes dynamic memorysuch as RAM.

Common forms of computer readable media include, for example, floppydisk, flexible disk, hard disk, magnetic tape, or any other magneticmedium; CD-ROM or any other optical medium; punch cards, paper tape, orany other physical medium with patterns of holes; RAM, PROM, EPROM,FLASH-EPROM, or any other memory chip or cartridge, or any othernon-transitory computer readable medium. Such data can be stored, forexample, in any form of external data repository 731, which in turn canbe formatted into any one or more storage areas, and which can compriseparameterized storage 739 accessible by a key (e.g., filename, tablename, block address, offset address, etc.).

Execution of the sequences of instructions to practice certainembodiments of the disclosure are performed by a single instance ofcomputer system 7A00. According to certain embodiments of thedisclosure, two or more instances of computer system 7A00 coupled by acommunications link 715 (e.g., LAN, PTSN, or wireless network) mayperform the sequence of instructions required to practice embodiments ofthe disclosure using two or more instances of components of computersystem 7A00.

Computer system 7A00 may transmit and receive messages such as dataand/or instructions organized into a data structure (e.g.,communications packets). The data structure can include programinstructions (e.g., application code 703), communicated throughcommunications link 715 and communications interface 714. Receivedprogram code may be executed by data processor 707 as it is receivedand/or stored in the shown storage device or in or upon any othernon-volatile storage for later execution. Computer system 7A00 maycommunicate through a data interface 733 to a database 732 on anexternal data repository 731. Data items in a database can be accessedusing a primary key (e.g., a relational database primary key).

Processing element partition 701 is merely one sample partition. Otherpartitions can include multiple data processors, and/or multiplecommunications interfaces, and/or multiple storage devices, etc. withina partition. For example, a partition can bound a multi-core processor(e.g., possibly including embedded or co-located memory), or a partitioncan bound a computing cluster having plurality of computing elements,any of which computing elements are connected directly or indirectly toa communications link. A first partition can be configured tocommunicate to a second partition. A particular first partition andparticular second partition can be congruent (e.g., in a processingelement array) or can be different (e.g., comprising disjoint sets ofcomponents).

A module as used herein can be implemented using any mix of any portionsof the system memory and any extent of hard-wired circuitry includinghard-wired circuitry embodied as a data processor 707. Some embodimentsinclude one or more special-purpose hardware components (e.g., powercontrol, logic, sensors, transducers, etc.). A module may include one ormore state machines and/or combinational logic used to implement orfacilitate the operational and/or performance characteristics pertainingto data access authorization for dynamically generated databasestructures.

Various implementations of the database 732 comprise storage mediaorganized to hold a series of records or files such that individualrecords or files are accessed using a name or key (e.g., a primary keyor a combination of keys and/or query clauses). Such files or recordscan be organized into one or more data structures (e.g., data structuresused to implement or facilitate aspects of data access authorization fordynamically generated database structures). Such files or records can bebrought into and/or stored in volatile or non-volatile memory.

FIG. 7B depicts a block diagram of an instance of a distributed dataprocessing system 7B00 that may be included in a system implementinginstances of the herein-disclosed embodiments.

Distributed data processing system 7B00 can include many more or fewercomponents than those shown. Distributed data processing system 7B00 canbe used to store data, perform computational tasks, and/or transmit databetween a plurality of data centers 740 (e.g., data center 740 ₁, datacenter 740 ₂, data center 740 ₃, and data center 740 ₄). Distributeddata processing system 7B00 can include any number of data centers. Someof the plurality of data centers 740 might be located geographicallyclose to each other, while others might be located far from the otherdata centers.

The components of distributed data processing system 7B00 cancommunicate using dedicated optical links and/or other dedicatedcommunication channels, and/or supporting hardware such as modems,bridges, routers, switches, wireless antennas, wireless towers, and/orother hardware components. In some embodiments, the componentinterconnections of distributed data processing system 7B00 can includeone or more wide area networks (WANs), one or more local area networks(LANs), and/or any combination of the foregoing networks. In certainembodiments, the component interconnections of distributed dataprocessing system 7B00 can comprise a private network designed and/oroperated for use by a particular enterprise, company, customer, and/orother entity. In other embodiments, a public network might comprise aportion or all of the component interconnections of distributed dataprocessing system 7B00.

In some embodiments, each data center can include multiple racks thateach include frames and/or cabinets into which computing devices can bemounted. For example, as shown, data center 740 ₁ can include aplurality of racks (e.g., rack 744 ₁, ..., rack 744 _(N)), eachcomprising one or more computing devices. More specifically, rack 744 ₁can include a first plurality of CPUs (e.g., CPU 746 ₁₁, CPU 746 ₁₂,..., CPU 746_(1 M)), and rack 744 _(N) can include an Nth plurality ofCPUs (e.g., CPU 746 _(N1), CPU 746 _(N2), ..., CPU 746 _(NM)). Theplurality of CPUs can include data processors, network attached storagedevices, and/or other computer controlled devices. In some embodiments,at least one of the plurality of CPUs can operate as a master processor,controlling certain aspects of the tasks performed throughout thedistributed data processing system 7B00. For example, such masterprocessor control functions might pertain to scheduling, datadistribution, and/or other processing operations associated with thetasks performed throughout the distributed data processing system 7B00.In some embodiments, one or more of the plurality of CPUs may take onone or more roles, such as a master and/or a slave. One or more of theplurality of racks can further include storage (e.g., one or morenetwork attached disks) that can be shared by one or more of the CPUs.

In some embodiments, the CPUs within a respective rack can beinterconnected by a rack switch. For example, the CPUs in rack 744 ₁ canbe interconnected by a rack switch 745 ₁. As another example, the CPUsin rack 744 _(N) can be interconnected by a rack switch 745 _(N).Further, the plurality of racks within data center 740 ₁ can beinterconnected by a data center switch 742. Distributed data processingsystem 7B00 can be implemented using other arrangements and/orpartitioning of multiple interconnected processors, racks, and/orswitches. For example, in some embodiments, the plurality of CPUs can bereplaced by a single large-scale multiprocessor.

In the foregoing specification, the disclosure has been described withreference to specific embodiments thereof. It will however be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the disclosure. Forexample, the above-described process flows are described with referenceto a particular ordering of process actions. However, the ordering ofmany of the described process actions may be changed without affectingthe scope or operation of the disclosure. The specification and drawingsare to be regarded in an illustrative sense rather than in a restrictivesense.

What is claimed is:
 1. A method for chunking data statements based atleast in part on a set of client-specific information in a client datastatement processing layer, the method comprising: receiving one or moredata statements issued by at least one client, the data statementsissued by the client to operate over a subject dataset; applying atleast a portion of a set of client-specific data to the data statementsto determine at least one chunking scheme; accessing performance data togenerate performance estimates for a set of candidate chunking schemesfrom the at least one chunking scheme; selecting a chunking scheme fromthe set of candidate chunking schemes based on the performanceestimates; generating one or more data operations from the datastatements, the data operations generated based at least in part on thechunking scheme and on the performance estimates; and executing the dataoperations over the subject dataset to generate a result set.
 2. Themethod of claim 1, wherein the client-specific data comprises at leastone of, one or more statement chunking rules, a set of enhanced datasetmetadata, or a set of performance data.
 3. The method of claim 1,wherein the data operations are executed at one or more query engines,and wherein the client-specific data is inaccessible by the queryengines.
 4. The method of claim 1, further comprising: receiving a setof dataset metadata associated with the subject dataset; expanding thedataset metadata into a set of expanded dataset metadata; and consultingthe expanded dataset metadata to perform at least one of, determiningthe at least one chunking scheme, or generating the one or more dataoperations.
 5. The method of claim 1, further comprising: analyzing thedata statements to determine one or more statement attributes associatedwith the data statements; and applying the portion of theclient-specific data to at least one of the statement attributes toperform at least one of, determining the at least one chunking scheme,or generating the one or more data operations.
 6. The method of claim 1,further comprising: generating one or more performance estimatesassociated with the data statements; and applying the portion of theclient-specific data to at least one of the performance estimates toperform at least one of, determining the at least one chunking scheme,or generating the one or more data operations.
 7. The method of claim 6,wherein at least one of the performance estimates is based at least inpart on a set of performance data.
 8. The method of claim 7, wherein theperformance data comprises at least one of, a set of historical dataoperations performance statistics, or a set of historical dataoperations behavioral characteristics.
 9. The method of claim 1, furthercomprising merging two or more results from the data operations into theresult set.
 10. The method of claim 1, wherein the data operations areexecuted in accordance with one or more execution directives, theexecution directives indicating that one or more of the data operationsbe executed in parallel, in sequence, asynchronously, or synchronously.11. A computer readable medium, embodied in a non-transitory computerreadable medium, the non-transitory computer readable medium havingstored thereon a sequence of instructions which, when stored in memoryand executed by one or more processors causes the one or more processorsto perform a set of acts for chunking data statements based at least inpart on a set of client-specific information in a client data statementprocessing layer, the acts comprising: receiving one or more datastatements issued by at least one client, the data statements issued bythe client to operate over a subject dataset; applying at least aportion of a set of client-specific data to the data statements todetermine at least one chunking scheme; accessing performance data togenerate performance estimates for a set of candidate chunking schemesfrom the at least one chunking scheme; selecting a chunking scheme fromthe set of candidate chunking schemes based on the performanceestimates; generating one or more data operations from the datastatements, the data operations generated based at least in part on thechunking scheme and on the performance estimates; and executing the dataoperations over the subject dataset to generate a result set.
 12. Thecomputer readable medium of claim 11, wherein the client-specific datacomprises at least one of, one or more statement chunking rules, a setof enhanced dataset metadata, or a set of performance data.
 13. Thecomputer readable medium of claim 11, wherein the data operations areexecuted at one or more query engines, and wherein the client-specificdata is inaccessible by the query engines.
 14. The computer readablemedium of claim 11, further comprising instructions which, when storedin memory and executed by the one or more processors causes the one ormore processors to perform acts of: receiving a set of dataset metadataassociated with the subject dataset; expanding the dataset metadata intoa set of expanded dataset metadata; and consulting the expanded datasetmetadata to perform at least one of, determining the at least onechunking scheme, or generating the one or more data operations.
 15. Thecomputer readable medium of claim 11, further comprising instructionswhich, when stored in memory and executed by the one or more processorscauses the one or more processors to perform acts of: analyzing the datastatements to determine one or more statement attributes associated withthe data statements; and applying the portion of the client-specificdata to at least one of the statement attributes to perform at least oneof, determining the at least one chunking scheme, or generating the oneor more data operations.
 16. The computer readable medium of claim 11,further comprising instructions which, when stored in memory andexecuted by the one or more processors causes the one or more processorsto perform acts of: generating one or more performance estimatesassociated with the data statements; and applying the portion of theclient-specific data to at least one of the performance estimates toperform at least one of, determining the at least one chunking scheme,or generating the one or more data operations.
 17. The computer readablemedium of claim 16, wherein at least one of the performance estimates isbased at least in part on a set of performance data.
 18. The computerreadable medium of claim 17, wherein the performance data comprises atleast one of, a set of historical data operations performancestatistics, or a set of historical data operations behavioralcharacteristics.
 19. A system for chunking data statements based atleast in part on a set of client-specific information in a client datastatement processing layer, the system comprising: a storage mediumhaving stored thereon a sequence of instructions; and one or moreprocessors that execute the instructions to cause the one or moreprocessors to perform a set of acts, the acts comprising: receiving oneor more data statements issued by at least one client, the datastatements issued by the client to operate over a subject dataset;applying at least a portion of a set of client-specific data to the datastatements to determine at least one chunking scheme; accessingperformance data to generate performance estimates for a set ofcandidate chunking schemes from the at least one chunking scheme;selecting a chunking scheme from the set of candidate chunking schemesbased on the performance estimates; generating one or more dataoperations from the data statements, the data operations generated basedat least in part on the chunking scheme and on the performanceestimates; and executing the data operations over the subject dataset togenerate a result set.
 20. The system of claim 19, wherein theclient-specific data comprises at least one of, one or more statementchunking rules, a set of enhanced dataset metadata, or a set ofperformance data.